Accurate genome-wide genotyping from archival tissue to explore the contribution of common genetic variants to pre-cancer outcomes

Purpose The contribution of common genetic variants to pre-cancer progression is understudied due to long follow-up time, rarity of poor outcomes and lack of available germline DNA collection. Alternatively, DNA from diagnostic archival tissue is available, but its somatic nature, limited quantity and suboptimal quality would require an accurate cost-effective genome-wide germline genotyping methodology. Experimental design Blood and tissue DNA from 10 individuals were used to benchmark the accuracy of Single Nucleotide Polymorphisms (SNP) genotypes, Polygenic Risk Scores (PRS) or HLA haplotypes using low-coverage whole-genome sequencing (lc-WGS) and genotype imputation. Tissue-derived PRS were further evaluated for 36 breast cancer patients (11.7 years median follow-up time) diagnosed with DCIS and used to model the risk of Breast Cancer Subsequent Events (BCSE). Results Tissue-derived germline DNA profiling resulted in accurate genotypes at common SNPs (blood correlation r2 > 0.94) and across 22 disease-related polygenic risk scores (PRS, mean correlation r = 0.93). Imputed Class I and II HLA haplotypes were 96.7% and 82.5% concordant with clinical-grade blood HLA haplotypes, respectively. In DCIS patients, tissue-derived PRS was significantly associated with BCSE (HR = 2, 95% CI 1.2–3.8). The top and bottom decile patients had an estimated 28% and 5% chance of BCSE at 10 years, respectively. Conclusions Archival tissue DNA germline profiling using lc-WGS and imputation, represents a cost and resource-effective alternative in the retrospective design of long-term disease genetic studies. Initial results in breast cancer suggest that common risk variants contribute to pre-cancer progression. Supplementary Information The online version contains supplementary material available at 10.1186/s12967-022-03810-z.


Introduction
The study of the contribution of germline genetic variation to disease risk or treatment outcome typically requires blood or saliva samples as a source of constitutive DNA. Depending on the phenotype studied, such samples may not be banked and readily available. Samples Page 2 of 14 Nachmanson et al. Journal of Translational Medicine (2022) 20:623 may have to be prospectively collected, which hinders studies requiring long-term follow-up or obtained after contacting potential subjects of interest, which can be logistically and ethically challenging or impossible if a patient has relocated or died. In cancer research, there is a growing interest in directly profiling tumor tissue to obtain germline measures such as ancestry, polygenic risk, and HLA-typing [1]. Array-based genotyping followed by imputation from a reference population has been a standard method to genotype genome-wide SNPs in the human genome, but its compatibility with DNA obtained from archival tissue specimens remains to be established [2,3]. The approach can be challenging when the amount of tissue available for research is limited, which is often the case with surgical excisions of premalignant lesions or with most needle biopsies.
Recently low-coverage whole-genome sequencing (lc-WGS) has emerged as an attractive alternative to single nucleotide polymorphism (SNP) array by offering higher throughput at a reduced cost, reduced DNA input, and improved genotyping accuracy [4][5][6]. In fact, recent studies have shown the feasibility of using frozen tissue for germline profiling by imputing genotypes from off-target reads repurposed from tumor-targeted panel sequencing data, effectively equivalent to ultra-low coverage (less than 0.1×) whole-genome sequencing [1]. It is therefore likely that, in the absence of available targeted sequencing data, lc-WGS can be performed with DNA of lower quality and quantity to enable the imputation of germline variants from archival tissue specimens.
If accurate, such an approach could have important implications for the study of the contribution of inherited risk factors to the progression of premalignant disease. For many cancer types, the widespread adoption of cancer screening has led to an increase in the detection of premalignant lesions. Despite such efforts, screening has had limited impact on overall survival [7]. Clinical guidelines vary widely from watchful waiting or biopsy as for prostatic intraepithelial neoplasia to surgery and adjuvant treatment as for ductal carcinoma in situ (DCIS) of the breast [8,9]. In absence of reliable progression risk biomarkers and models, these interventions may have deleterious consequences at the two clinical extremes: delay in life-saving treatment or complications from overtreatment. DCIS is the most common breast cancer-related diagnosis, comprising ~ 20% of annual cases in the U.S. [10]. In breast disease, factors that impact the risk of breast cancer subsequent event (BCSE), defined as an in situ or invasive breast cancer neoplasm developed at least 6 months after treatment of a DCIS diagnosis, include age, size, grade of the lesion, hormone receptor status, and molecular profile. Their combined effect in risk models such as the University of Southern California/Van Nuys Prognostic Index has not resulted in any reliable BCSE risk prediction model and additional, more in-depth molecular and histological characterization is needed [11][12][13][14][15][16].
Given the independence between DCIS and associated BCSE in upwards of 20% of cases as evidenced by molecular studies comparing genomic profiles of initial DCIS and subsequent ipsilateral BCSE, systemic risk factors need to be considered in addition to those related to the index lesion [17]. While penetrant germline pathogenic variants exist and represent strong risk factors in breast cancer susceptibility genes such as BRCA1, BRCA2, CHEK2, PALB2, and PMS2, they are only present in 1.5% of all women [18]. Meanwhile, population-based genome-wide association studies (GWAS), have identified multiple common variants associated with lifetime risk of invasive breast cancer (IBC) [2,19]. The same SNPs have also been associated with risk of DCIS demonstrating the shared genetic susceptibility for IBC and DCIS [20]. It is however unclear if these SNPs are also associated with DCIS progression. Polygenic risk scores (PRS) derived from the allelic burden of risk-associated SNPs are now being added to common breast cancer risk models, significantly improving their performance, with individuals in the top percentile having a three-to fivefold increase in lifetime risk relative to women with risk scores in the middle quintile of those studied [21,22]. It is thus possible that DCIS patients with elevated breast cancer PRS are also at higher risk of BCSE and the addition of PRS could improve DCIS prognostic models akin to lifetime breast cancer risk models. Since BCSE can occur years after the initial DCIS diagnosis and is uncommon-observed in 10 to 25% of patients after 10 years, depending on treatment and known risk factors-a retrospective study is much more feasible for the purposes of validation [23,24]. Formalin-fixed paraffin-embedded (FFPE) tissue (referred to as archival tissue) from the DCIS biopsy or resection are therefore the only source of genetic material available and their validity for genome-wide genotyping of germline variants would be critical to the feasibility of such study.
Here we evaluate the validity of repurposing archival tissue specimens for germline genetic studies. We performed lc-WGS and imputed genotypes for 10 pairs of matching blood and tumor tissue samples to benchmark the accuracy for calling genome-wide genotypes, HLA haplotypes, and for implementing PRS. The reported results indicate the high accuracy of germline genotypes and haplotypes obtained from archival tissue DNA. Using this methodology we present a use-case measuring breast cancer PRS in 36 DCIS patients and explore its association with BCSE.

Concordance of lc-WGS imputed genotypes between blood DNA and FFPE tissue DNA
In order to establish the analytical validity of FFPE tissue DNA for germline genotyping and genotype imputation from lc-WGS, we selected 10 subjects including from European, African, and Asian ancestries with matching FFPE tissue and whole blood. Given the lack of matching blood specimens for most DCIS archival tissue in this retrospective study, we chose to perform this benchmarking using specimens from lung cancer patients, which should not affect the overall conclusions of this analytical validation. The archival tissue blocks were between 3 and 9 years old and yielded between 5 and 176 ng of DNA, which was then prepared for sequencing with a low-input protocol (see "Materials and methods"). Mean coverage depth was 0.92× (range 0.68-1.41×) and 0.7× (range 0.44-0.97) for blood and tissue, respectively. Genotypes were imputed using a Gibbs sampling method specifically designed for lc-WGS, which leverages haplotype reference panel information (1000G 30× NYGC reference panel-N = 3202 individuals; see "Materials and methods") [5,25]. Overall genotypes were imputed for 61,715,567 SNPs in each of the 20 samples, of which 43,274,690 (70.1%) were considered high quality (Impute INFO score > 0.80) [26]. Genotype concordance between blood and tissue increased with the minor allele frequency (MAF) of the variant in the global population. For SNPs with MAF of 0.1 or more, the aggregate r 2 was greater than 91% for all SNPs, and greater than 94% for high-quality SNPs (Fig. 1a). The concordance between blood and tissue was not lower for the two individuals who were non-white (Additional file 1: Figure S1). In contrast, the concordance was lower (87% at SNPs with MAF greater or equal to 0.1) when the sequencing coverage depth of the tissue DNA was lower (Additional file 1: Figure S2). Overall, the strongest discordance between blood and tissue was observed for SNPs at MAF lower than 0.01 which are typically imputed with decreased accuracy irrespective of the sample type [27].
The presence of somatic mutations and copy number alterations (CNA) in DNA from malignant cells has the potential to decrease local imputation accuracy. In particular, CNA may play a larger role than somatic mutations, as recently reported [1]. We estimated the effect of CNA status on the genotype concordance between blood and tissue across SNPs located in DNA regions that are copy neutral, in a copy gain, or in a copy loss. The studied DNA samples had, on average, 15% of the genome (range 0 to 65%) involved in CNA while no CNA was detected in the blood (see "Materials and methods", Additional file 1: Table S1). Common SNPs (MAF ≥ 0.1) located in copy neutral or copy gain regions had a remarkable blood-tissue genotype concordance r 2 higher than 95%, while those in regions of copy number loss showed lower concordance r 2 of 83% (Fig. 1b). The decreased imputation accuracy in areas of copy number loss can likely be explained by the decrease in allele-specific coverage depth, resulting in missed heterozygotes or a sparser scaffold for imputation. Fig. 1 Assessment of genome-wide concordance of lc-WGS imputed genotypes in tissue versus blood of N = 10 patients. a, b Genome-wide concordance (Pearson correlation coefficient squared-y-axis) of allele dosages across all genotyped SNPs between blood and tissue as a function of their minor allele frequency (MAF, x-axis). Concordance was calculated for each individual and each filtering category including genotype imputation quality (a) with all genotypes shown in light green and high-quality genotypes (INFO > 80) in dark green, and copy number status of high-quality genotypes in tissue (b), from SNPs located in a region that was copy neutral (orange), gain (red) or loss (blue). For any given bin corresponding to a patient, MAF and filtering category had to have a minimum of 1000 SNPs to be included. Error estimates from 95% confidence intervals computed from 1000 bootstrapping iterations are indicated as shaded areas We conclude that tissue-derived genome-wide genotypes faithfully represent germline profiles obtained from blood, especially at SNPs frequent in the population (MAF ≥ 0.1). Discrepancies between tissue and blood can be explained by decreased coverage depth caused by technical (insufficient sequencing) or genetic (copy number loss) limitations and mainly affecting rare SNPs (MAF < 0.01). These lower frequency SNPs are less likely to reach statistical significance in GWAS studies unless they have extreme effect size and therefore are rarely incorporated into PRS models. Taken together, the results suggest the feasibility of using archival tissues as a source of constitutive DNA in genetic studies relying on common SNPs.

Concordance of tissue-derived PRS
We next sought to further validate the performance of tissue-based genome-wide genotyping to accurately estimate PRS in individuals. Germline variants can be used to estimate disease risk in individuals by summing the effects of previously identified risk alleles carried by an individual into a personalized PRS. The clinical utility of PRS is currently being evaluated in multiple settings, including breast cancer screening and surveillance, where elevated PRS can be included in lifetime risk models [28]. The ability to accurately estimate PRS retrospectively, using archival tissue DNA, would greatly improve the ability to conduct large retrospective studies with long-term outcomes. We investigated multiple PRS derived from GWAS of susceptibility to 16 cancer types, and 6 non-cancer phenotypes [29][30][31]. We computed a tissue and blood-derived PRS for 10 individuals (see "Materials and methods") using the imputed genotypes from lc-WGS sequencing data described above. Overall 93% (2744 of 2962) of PRS single-nucleotide variant sites were successfully imputed, 84% of which were high quality (Additional file 1: Table S2). In each of the 16 cancer types, the tissue-derived PRS closely matched the bloodderived PRS, evidenced by high correlation coefficients (r ≥ 0.9) in 12/16 of the PRS (Fig. 2a). We saw similar results when evaluating PRS for non-cancer phenotypes, with 4/6 being highly correlated (r ≥ 0.9) (Fig. 2b). Differences between PRS in blood and tissue were associated with decreased tumor genome coverage (r = − 0.26, p = 0.02), but not with copy number loss (Additional file 1: Figure S3). Overall we report that archival tissue DNA profiled with lc-WGS resulted in a reliable PRS estimate in an individual and preserved relative ranks in a cohort enabling studies such as the use case presented below.

Contribution of breast polygenic risk scores to DCIS prognosis
We next demonstrated the utility of lc-WGS to investigate the contribution of breast cancer PRS to predict breast cancer subsequent events (BCSE-in situ or invasive, irrespective of laterality) after a DCIS diagnosis using a retrospective study design (Fig. 3a). We assembled a cohort of patients diagnosed with pure DCIS (N = 31 cases) who were then diagnosed with a BCSE at least 6 months after the DCIS diagnosis. We then complemented this cohort with a set of patients (N = 19 controls) diagnosed with pure DCIS who did not develop a BCSE for at least 5 years. A median of 51.2 ng (range 6.6-300) of DNA was extracted from the primary DCIS FFPE specimen archived between 6 and 25 years (Additional file 1: Table S3). The extracted DNA was sequenced to an average coverage depth of 0.89× (range 0.2-1.8×) (see "Materials and methods"). Fourteen out of 50 (28%) samples yielded insufficient coverage (N = 5) or had evidence of contamination with another patient (N = 9) and were excluded, leaving 22 cases and 14 controls for analysis (Table 1). Both cases and controls were equally affected (9/31 vs 5/19, p = 1; Fisher Exact Test). The median time to BCSE was 6.2 years (min: 1.4, max: 10.9), and patients without BCSE had a median time to followup of 11.7 years (min: 6.7, max: 19.6). Cases and controls were approximately matched for age, ancestry, DCIS size, grade, and ER status (Additional file 1: Table S4). We then performed imputation as described earlier which resulted in high-quality genotypes at a total of 27,605,021 SNP loci. In order to evaluate the relationship between breast cancer PRS and DCIS prognosis, we evaluated two recent and well established breast cancer PRS, measuring risk for overall breast cancer, consisting of 526 total and 270 unique sites (see "Materials and methods", Additional file 1: Table S2) [22,32]. We computed PRS for each patient, and compared groups with and without BCSE (Fig. 3b) (see "Materials and methods"). Patients with BCSE showed higher PRS values across both (mean 1.15× fold increase, minimum p = 0.05). We next measured the prognostic value of PRS in a multivariate Cox proportional hazard model to account for other risk factors previously associated with DCIS progression such as age, DCIS size, histological grade, and ancestry. We found that both breast cancer PRS had highly similar hazard ratios (mean 2.07), but found only one to have significant (q < 0.01) impact on BCSE risk, with the hazard ratio of 2.07 (95% CI 1.2-3.6, q = 0.0188), respectively (Fig. 3c, Additional file 1: Figure S4). Adding PRS to the model slightly improved the discrimination between patients with and without BCSE raising the mean C-index from 0.60 to 0.63 (Fig. 3d). In contrast, none of the six non-cancer PRS contributed significantly to the BCSE prognosis, indicating that the effects observed are likely specific to the underlying genetic risk specific to breast cancer (Fig. 3e). Using an extended Cox regression model, accounting for different incidence of cases and controls (see "Materials and methods"), we estimate that 10 years post-DCIS diagnosis, approximately 28% of patients with the highest decile of breast PRS will have a BCSE, as opposed to approximately 5% of patients with PRS in the lowest decile (Fig. 3f ).
(See figure on next page.) Fig. 3 Breast cancer polygenic risk score in DCIS patients with and without a breast cancer subsequent event. a Schematic overview of the study design. Treatment consisted of surgery and adjuvant radiation or endocrine therapy. b Comparison of breast cancer PRS score distribution between patients with (red) or without (black) a breast cancer subsequent event (BCSE). Dashed vertical lines represent mean normalized PRS values for each respective group. A logistic regression was constructed with each PRS and DCIS size, grade, age and ancestry for outcome of 5-year recurrence, the resulting Pseudo-R-squared, p-value and Bonferroni corrected q-values are listed for each PRS. Distributions were generated using kernel density estimates of histograms. c Forest plot representation of hazard ratios (square) and 95% confidence intervals (error-bars), for each tested breast cancer PRS, obtained from a Cox Proportional-Hazard model accounting for DCIS size, grade, and age, the ancestry of the patient (Additional file 1: Figure S4). The dotted line represents a log hazard ratio of 1, or having no effect on the outcome. The q-values represent Bonferroni corrected p-values for the effective number of tests. Significant hazard ratios (q < 0.05) are indicated in bold text. d Evaluation of discrimination of Cox proportional hazard model for BCSE vs no BCSE outcome using Harrel's C-index (y-axis) for models only using available risk factors versus available risk factors and breast cancer PRS, colored by the significance of hazard ratios for breast PRS (q < 0.05, light green). e Same as c but for non-cancer PRS. f Cox proportional hazard estimate of breast cancer subsequent event (BCSE)-free survival for two PRS over time in years. Curves are obtained by varying PRS (solid colored lines from blue as lowest and red as highest PRS percentile), as compared to each model baseline (dashed line) while keeping all other covariates the same. Each case and control was weighted by the epidemiological incidence of BCSE treated with surgery and endocrine therapy (15% at 10 years) [24]  Even with this limited dataset, there is a suggestive contribution of pre-established breast cancer PRS in DCIS prognosis, though this will require validation in a larger independent cohort. Independent of its possible clinical significance, and acknowledging the need for additional validation of the results, the presented use case demonstrates the feasibility of using DNA from tissues archived for decades to associate germline genetic factors with long-term patient outcomes and gain new insight into disease etiology and progression.

Imputation of HLA-gene alleles from lc-WGS
In addition to SNP genotyping, we next investigated whether lc-WGS of archival tissue could be used to determine the haplotypes of the various HLA genes. HLA genes are some of the most polymorphic genes in the human genome and the major histocompatibility complex plays a critical role in antigen presentation to the immune system, particularly in tumorigenesis [33][34][35]. Using samples collected from 14 patients, including 10 patients with both blood and tissue DNA available, we compared HLA-alleles imputed from genome-wide genotypes obtained from lc-WGS against the results of clinical-grade deep targeted sequencing of the HLA locus from matching blood DNA samples (referred to as gold standard-see "Materials and methods"). Alleles for Class I (HLA-A, B, C genes) and Class II (DRB1, DQB1 genes) were imputed using QUILT-HLA against the 1000G reference panel [6]. Overall 4 field allele calls from blood DNA were 92.8% (78/84) and 80.4% (45/56) concordant with the gold standard for Class I and Class II genes respectively (Fig. 4a). At a lower 2 field resolution, the concordance was 97% for Class I and 91% for Class II (Additional file 1: Figure S5). The decreased accuracy for HLA Class II, particularly for DRB1 likely reflects the increased diversity of these loci in comparison to Class I as well as the presence of pseudo-genes which may introduce ambiguity in the alignment of short sequence reads [36]. In order to evaluate the effect of DNA source on HLA-typing accuracy, we compared tissue-derived HLA types to the gold standard. We found 4 field allele calls from tissue were 96.7% (58/60) and 82.5% (33/40) concordant with the gold standard blood HLA-typing, for Class I and Class II respectively (Fig. 4b). In 49/50 comparisons between blood and tissue, tissue-derived samples provided as accurate calls, suggesting that the DNA source did not have an impact on imputation quality (Fig. 4c). Overall, HLA-types that did not match the gold standard had worse imputation quality as reflected by their lower posterior probabilities (Fig. 4d). The high accuracy of HLA-typing from lc-WGS as well as the consistent results between blood and tissue-based DNA demonstrates that remarkably, imputed HLA-types from lc-WGS on archival tissue are comparable against deep targeted HLA sequencing on blood, with a fraction of the required DNA input and a streamlined protocol.

Discussion
Here we rethink the traditional design of germline genetic studies by answering the question, when typical DNA sources such as blood, saliva or urine are unavailable, can we extract the same information from archival tissue specimens? Often collected for histological examination and diagnosis and then stored indefinitely, these samples offer an abundant source of genetic material from patients with potentially long clinical followup. By using lc-WGS and recent advances in genotype imputation, we compared the concordance of germline genotypes obtained from blood DNA and archival tissue DNA in 10 different individuals. Archival tissue faithfully represented the germline profile of common SNPs obtained from blood both at the genome-wide level and across well-established PRS. Beyond concordance at the SNP-level, we also demonstrated accurate genotyping at highly polymorphic HLA alleles. To our knowledge, we present the first evidence that HLA-typing using lc-WGS from archival tissue is as accurate as true clinicalgrade HLA-typing. The restriction to lung tissue in the comparison may have some limitations. The tissue was on average more recent than the one used for the breast cancer study and therefore results in higher quality DNA and overestimation of the accuracy of the approach. This would however not affect the genotypes of a particular patient, but rather result in a larger number of non-evaluable samples due to overall assay failure. Lung cancer may also have increased mutational burden or genomic instability resulting in a possible underestimation of the accuracy compared to breast DCIS. A more comprehensive collection of various tissue types, grades, age or fixation conditions would be needed to fully understand the impact of histological variables on archival tissue-based genotyping. Nevertheless, our results support the future utilization of archival tissue to construct large retrospective studies to characterize the role of germline variants in disease etiology, progression, and treatment. The use of archival tissue as a source of constitutive DNA will enable a wealth of retrospective studies by repurposing specimens archived by most clinical sites to help address the genetic underpinnings of disease with long-outcome, such as the progression of pre-malignant lesions as presented here. Such studies would either require long follow up after the initial sample collection, or a massive and costly effort to retrospectively collect blood or saliva samples. In contrast, provided the subjects have been offered diagnostic biopsies, or surgical treatment, the course of their clinical care or study participation, their left-over specimen can be used to enable post-hoc genetic analysis. Such studies would require approval of the Institutional Review Boards (IRB) and, since 2015, informed consent needs to be explicit about the use of specimens and data for genetic research and the risk for privacy it entails [37]. Commonly, IRBs waive the requirement for consent from patients deceased or lost to follow-up, however, such data needs to be distributed with caution and typically protected by a Data Access Policies the researcher has to comply with. As such, while our approach can enable large retrospective genetic studies where informed consent may be waived, the eligibility of each patient, and the overall data sharing policy need to be carefully considered.
Our report includes the application of the approach to interrogate the contribution of genetic factors to breast DCIS progression. The relatively good outcome of the disease poorly justified a thorough collection of risk variables, especially those related to inherited risk. However, overtreatment of DCIS, and its harms, is being increasingly acknowledged and systematic reviews of clinicopathological factors have not resulted in reliable models of progression [11,12,38]. Most epidemiological studies need to be large due to the slow progression and rarity of poor outcomes and rely exclusively on medical chart review [24,39,40]. As such, additional factors that are hard or impossible to collect from the charts such as mammography or magnetic resonance imaging, digital pathology, or germline inherited factors have not been as thoroughly and systematically investigated. We made the narrow hypothesis that lifetime breast cancer susceptibility-which can be seen as progression from normal to malignant epithelium-and progression of DCIS share the same genetic risk factors. We tested this hypothesis by measuring breast PRS in a small cohort of carefully selected DCIS subjects using our approach. Interestingly, the effect size observed (HR = 2) is higher than the one observed for other risk factors to DCIS progression, suggesting PRS could significantly improve previous risk models [22,41,42]. Thanks to the accurate PRS estimate obtained from left-over surgical specimens, we were able to see that germline variation likely contributed significantly to the DCIS progression to an extent similar or greater to previously investigated risk factors such as grade, age, and Her2 overexpression [38]. The case-control design of the study is suboptimal and the findings would need to be validated in a larger cohort of unselected patients, where a more comprehensive set of covariates would be accounted for, including treatment. Subsequent larger studies would also be important to evaluate competing risk models for subsequent in situ versus invasive disease, or laterality of the event, where PRS may contribute more in particular contexts. The modest cost and relative experimental simplicity of our approach, accompanied by a state-of-the-art imputation strategy can likely be scaled up provided diagnostic sections or left-over specimens can be found. Several large DCIS cohorts are generating mutational profiles, including some with lc-WGS and associated with clinical outcomes, which would be particularly suitable for validation in the future [17,43].
In the study of malignant progression as well as the onset and progression of multiple other diseases, the overactivity or inactivity of the immune system represents a key factor. A large contribution of variation in immune traits is inherited and yet the role of this contribution in disease progression is poorly understood [44,45]. In particular, the genetic diversity of the MHC, one of the most polymorphic regions of the genome, is a real challenge to study the role of the adaptive immune system. In the context of tumorigenesis, the failure of the major histocompatibility complex (MHC) to present antigens to the immune system is being increasingly recognized as contributing to cancer immune evasion and failure to respond to immune checkpoint inhibitors [46][47][48]. The determination of the HLA haplotypes, encoding the MHC is typically limited to the setting of organ or bone marrow transplants and not typically performed in other epidemiological studies. Recent reports however show the importance of the HLA-type in understanding the exposed mutanome, and its consideration can have important predictive value in the context of immunotherapies [33,34,49]. But with a lack of systemic HLA-typing or absence of genetic material to do so, such studies are hard to replicate or scale-up. To address this, we demonstrated that we can assign 4 field alleles to HLA-A, B, C, and DRB1, DQB1 genes by reference informed imputation of lc-WGS data [6]. These imputed HLA-types had comparable accuracy to deep targeted sequencing of the HLA locus with a fraction of the required DNA input (5 vs 40,000 ng) and with a simplified protocol (no need for targeted capture). The improvement in both sample requirement and throughput to HLA-typing supports the evaluation in lc-WGS with imputation in replacing current clinical standard tests. While offering many benefits, there are still some limitations to lc-WGS paired with imputation for germline profiling of archival tissue. Similar to previous reports benchmarking lc-WGS imputation, error increases with decreasing minor allele frequency [5,6]. This would preclude the use of this strategy for the identification of rare variants of high penetrance associated with familiar risk (BRCA , Lynch, or Li-Fraumeni syndromes). Similarly, genotypes from rare risk-associated SNPs or HLA-types only found in small populations would be more likely missed by this approach. In the future, the availability of even larger and more diverse reference populations may help mitigate this effect. For the purposes of this study we utilized the unrestricted 1000G reference panel (N = 3202 haplotypes), however larger extensive, though restricted, panels such as Haplotype Reference Consortium (HRC) (N = 64,976) or TopMed (N = 53,831) exist [25,50,51]. Low coverage depth represents an additional limitation of our approach. While a restricted number of reads sequenced from a WGS library can result in decreased imputation accuracy, another source of tumor-specific decreased coverage is somatic copy number alterations (CNA). We observed that regions in a copy number loss resulted in decreased imputation accuracy. Similar observations were recently reported in a study performing germline imputation from discarded reads from targeted-sequencing tumor-derived tissue [1]. Here the choice of the tissue source, or the possibility to dissect normal histological regions, can help mitigate these effects. Indeed the use of adjacent normal tissue, pre-malignant or low-grade lesions or even lymphocytic aggregates, or lymph node specimens would enrich for diploid cells resulting in fewer inaccurate genomic regions. In contrast, imputation in high-grade lesions or invasive tumors with prominent aneuploidy needs to be carefully considered and may be mitigated in the largest dataset where available CNA profiles could be used as prior information in the imputation strategy.

Conclusion
In conclusion, our study demonstrates that archival tumor tissue is an appropriate DNA source to measure germline genetic variation in lieu of normal tissue or blood. By shallow sequencing of the genome, and imputing missing sequences using haplotypes from thousands of individuals, the resulting genotypes, particularly for common SNPs and HLA alleles between blood and archival tissue were quite comparable. Especially in the study of slow progressing or rare diseases which may have been logistically unrealistic due to a long time to events and large sample numbers required, this framework has the potential to enable very large retrospective genetic studies, driving both basic research and translational discoveries.

Patient selection
For the tissue-blood benchmarking study, a total of N = 14 Lung adenocarcinoma cancer patients with available tumor tissue and matching buffy coat in N = 10 were selected from the Moores Cancer Center Tissue and Technology Shared Resource (BTTSR).
For the DCIS PRS study, a total of 50 patients were originally selected from the UC San Diego ATHENA DCIS registry-a retrospective registry approved by the UCSD and UCSF IRB. Case patients with BCSE were first selected on the basis of time to BCSE, surgery type, care location, and availability of archival tissue blocks. Control patients were then selected from patients without BCSE, with long follow-up time and matching cases for risk factors including age at DCIS, ancestry, DCIS grade, DCIS size, treatment type, ER, and Her2 status when available (Table 1).

Sample preparation
Blood DNA was extracted from 50 µL of buffy-coat using DNAeasy blood and tissue kit (Qiagen). Tissue blocks were sectioned in 5 µm scrolls and 3 to 5 scrolls were used to extract DNA with Covaris FFPE truXTRAC FFPE tNA kit using M220 Covaris Focused UltraSonicator

Low-coverage whole genome sequencing (lc-WGS)
Between 5 and 300 ng of DNA was used as input for the library preparation using NEB Ultra II FS library preparation kit (New England Biolabs), which combines enzymatic fragmentation with end-repair and A-tailing in the same tube.

Imputation of genotypes from lc-WGS
Genome-wide genotypes were imputed using lc-WGS specific method GLIMPSE (v1.1.1) with the hg38 version of 1000G 30× NYGC reference panel (N = 3202 individuals) [5,25] Phasing and imputation were performed directly on BAM files in individual chunks of each chromosome using "GLIMPSE_phase", and then the imputed variants were subsequently ligated together for each chromosome using "GLIMPSE_ligate". We note that short insertions and deletions were excluded from any analysis as these are currently unreliable from lc-WGS and not currently imputed by the strategy implemented [1].

Measuring imputation concordance
Imputation concordance between samples was summarized using squared Pearson correlation values obtained from the bcftools "stats" function (v1.9), which captures the correlation between allele dosages of variants in each minor allele frequency (MAF) bin [58]. Variants across all the autosomes were used in genome-wide benchmarking performance and all chromosomes for PRS evaluations.

Copy number analysis
Copy number alterations (CNAs) were called using CNVkit (v0.9.9) in "wgs" mode, average bin size was set at 100,000 bp [59]. A set of unrelated normal tissues sequenced with the same protocol were used to generate a panel of normals used during CNA calling. Any bins with a log 2 copy ratio lower than − 15, were considered artifacts and removed. Breakpoints between copy number segments were determined using the circular binary segmentation algorithm (p < 10 − 4 ). Copy number genomic burden was computed as the sum of sizes of segments in a gain (log 2 (ratio) > 0.3) or loss (log 2 (ratio) < − 0.3) over the sum of the sizes of all segments.

Measuring PRS
Polygenic risk scores (PRS) were computed using the following equation: . PRS is computed as a function of β i which is the per-allele log odds ratio, or beta coefficient for the risk SNP allele i, and x i is the dosage of the risk allele i {0,1,2}, and n is the total number of SNPs composing the PRS. PRS scores were then scaled using z-score transformation. PRS sites and effect weights were all obtained from the Polygenic Score (PGS) Catalog [61]. The catalog numbers and descriptions of each PRS are listed in Additional file 1: Table S2.

Cox proportional hazard model construction for breast PRS
Cox proportional hazard models were constructed nonparametrically, using Breslow's method with robust estimates in lifelines survival analysis package in Python [62].
As the case-control study design does not reflect true DCIS population incidence, we gave each case a weight of 1, and each control a weight of 1 p , with p being the proportion of controls sampled out of the total needed to reflect epidemiological incidence of BCSE treated with surgery and endocrine therapy (15% at 10 years) [24,63,64]. A separate model was constructed for each of the six evaluated breast PRS, in order to measure the effect on risk of BCSE, by PRS, DCIS nuclear grade, age of the patient at diagnosis, the size of the lesion, and whether the ancestry of the individual was European or not. Each covariate was tested for violation of the proportional hazards assumption. The 5 samples missing lesion size, were excluded from the model. In the 6 DCIS samples missing grade, grade was assigned on the basis of the tertiles of copy number burden distribution observed in the cohort since grade and copy number burden are highly correlated [14].

Multiple hypothesis correction for non-independent PRS
In order to perform multiple hypothesis correction on multiple non-independent PRS, such as the breast PRS, we implemented the Li and Ji method in R package meff to estimate for the effective number of tests performed [65,66]. The effective number of tests was then used to generate Bonferroni corrected p-values, labeled as q-values.
Additional file 1: Figure S1. Effect of patient ancestry on lc-WGS imputation concordance between blood and tissue. Comparison of concordance between blood and tissue-based on ancestry background of the patient, with White ancestry in light green and Black or Asian ancestry in dark green. Pearson correlation squared (r 2 ) is for all aggregated SNPs within a MAF bin. When available, 95% confidence intervals are shaded around the line based on 1000 bootstrap iterations. Figure S2. Effect of coverage-related features on lc-WGS imputation concordance between blood and tissue. (a, b) Comparison of concordance as measured by squared Pearson correlation (y-axis) between blood and tissue as a function of MAF (x-axis) based on mean sequencing genome coverage depth of blood (a) or tissue (b). Figure S3. Effect of coverage-related features on the error in non-cancer PRS calculation. (a, b) PRS error (y-axis), as measured by the absolute difference between blood and tissue PRS across all non-cancer PRS, as a function of (a) fraction of genome in a copy number loss or (b) mean tissue/tumor genome coverage (x-axis). The 95% confidence intervals are shaded around the line based on 1000 bootstrap iterations. Spearman correlation coefficient, r, and corresponding p-value are indicated as text in the upper right. Figure S4. Cox proportional hazard models measuring BCSE outcome in DCIS patients for 6 breast cancer PRS. Forest plot representation of hazard ratios (square) and 95% confidence intervals (error-bars), for each normalized breast cancer PRS and covariates for DCIS BCSE risk including DCIS nuclear grade (Grade), age of the patient at diagnosis (Age), the size of the DCIS lesion (Size), and whether the ancestry of the individual was European (EUR). The dotted line represents a hazard ratio of 1, indicating no effect on BCSE risk, > 1 indicating increased, and < 1 indicating decreased risk. Figure S5. Assessment of 2 field HLA-typing accuracy from lc-WGS. Number of concordant HLA alleles (0: white, 1: grey, 2: black) between haplotypes from the clinical gold standard and those imputed using QUILT-HLA for class I (A, B, C) and class II (DQB1 and DRB1) HLA genes (rows) using blood DNA of 14 patients. Table S1. Description of the studied samples. Table S2. PRS description. Table S3. DCIS cohort technical characteristic description. Table S4. DCIS cohort covariate association with patient outcome.